13 research outputs found

    Application of Clinical Concept Embeddings for Heart Failure Prediction in UK EHR data

    Get PDF
    Electronic health records (EHR) are increasingly being used for constructing disease risk prediction models. Feature engineering in EHR data however is challenging due to their highly dimensional and heterogeneous nature. Low-dimensional representations of EHR data can potentially mitigate these challenges. In this paper, we use global vectors (GloVe) to learn word embeddings for diagnoses and procedures recorded using 13 million ontology terms across 2.7 million hospitalisations in national UK EHR. We demonstrate the utility of these embeddings by evaluating their performance in identifying patients which are at higher risk of being hospitalised for congestive heart failure. Our findings indicate that embeddings can enable the creation of robust EHR-derived disease risk prediction models and address some the limitations associated with manual clinical feature engineering.Comment: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.0721

    Evaluation of data processing pipelines on real-world electronic health records data for the purpose of measuring patient similarity

    Get PDF
    BACKGROUND: The ever-growing size, breadth, and availability of patient data allows for a wide variety of clinical features to serve as inputs for phenotype discovery using cluster analysis. Data of mixed types in particular are not straightforward to combine into a single feature vector, and techniques used to address this can be biased towards certain data types in ways that are not immediately obvious or intended. In this context, the process of constructing clinically meaningful patient representations from complex datasets has not been systematically evaluated. AIMS: Our aim was to a) outline and b) implement an analytical framework to evaluate distinct methods of constructing patient representations from routine electronic health record data for the purpose of measuring patient similarity. We applied the analysis on a patient cohort diagnosed with chronic obstructive pulmonary disease. METHODS: Using data from the CALIBER data resource, we extracted clinically relevant features for a cohort of patients diagnosed with chronic obstructive pulmonary disease. We used four different data processing pipelines to construct lower dimensional patient representations from which we calculated patient similarity scores. We described the resulting representations, ranked the influence of each individual feature on patient similarity and evaluated the effect of different pipelines on clustering outcomes. Experts evaluated the resulting representations by rating the clinical relevance of similar patient suggestions with regard to a reference patient. RESULTS: Each of the four pipelines resulted in similarity scores primarily driven by a unique set of features. It was demonstrated that data transformations according to each pipeline prior to clustering can result in a variation of clustering results of over 40%. The most appropriate pipeline was selected on the basis of feature ranking and clinical expertise. There was moderate agreement between clinicians as measured by Cohen's kappa coefficient. CONCLUSIONS: Data transformation has downstream and unforeseen consequences in cluster analysis. Rather than viewing this process as a black box, we have shown ways to quantitatively and qualitatively evaluate and select the appropriate preprocessing pipeline

    Methods for enhancing the reproducibility of biomedical research findings using electronic health records.

    Get PDF
    BACKGROUND: The ability of external investigators to reproduce published scientific findings is critical for the evaluation and validation of biomedical research by the wider community. However, a substantial proportion of health research using electronic health records (EHR), data collected and generated during clinical care, is potentially not reproducible mainly due to the fact that the implementation details of most data preprocessing, cleaning, phenotyping and analysis approaches are not systematically made available or shared. With the complexity, volume and variety of electronic health record data sources made available for research steadily increasing, it is critical to ensure that scientific findings from EHR data are reproducible and replicable by researchers. Reporting guidelines, such as RECORD and STROBE, have set a solid foundation by recommending a series of items for researchers to include in their research outputs. Researchers however often lack the technical tools and methodological approaches to actuate such recommendations in an efficient and sustainable manner. RESULTS: In this paper, we review and propose a series of methods and tools utilized in adjunct scientific disciplines that can be used to enhance the reproducibility of research using electronic health records and enable researchers to report analytical approaches in a transparent manner. Specifically, we discuss the adoption of scientific software engineering principles and best-practices such as test-driven development, source code revision control systems, literate programming and the standardization and re-use of common data management and analytical approaches. CONCLUSION: The adoption of such approaches will enable scientists to systematically document and share EHR analytical workflows and increase the reproducibility of biomedical research using such complex data sources

    Identifying clinically important COPD sub-types using data-driven approaches in primary care population based electronic health records.

    Get PDF
    BACKGROUND: COPD is a highly heterogeneous disease composed of different phenotypes with different aetiological and prognostic profiles and current classification systems do not fully capture this heterogeneity. In this study we sought to discover, describe and validate COPD subtypes using cluster analysis on data derived from electronic health records. METHODS: We applied two unsupervised learning algorithms (k-means and hierarchical clustering) in 30,961 current and former smokers diagnosed with COPD, using linked national structured electronic health records in England available through the CALIBER resource. We used 15 clinical features, including risk factors and comorbidities and performed dimensionality reduction using multiple correspondence analysis. We compared the association between cluster membership and COPD exacerbations and respiratory and cardiovascular death with 10,736 deaths recorded over 146,466 person-years of follow-up. We also implemented and tested a process to assign unseen patients into clusters using a decision tree classifier. RESULTS: We identified and characterized five COPD patient clusters with distinct patient characteristics with respect to demographics, comorbidities, risk of death and exacerbations. The four subgroups were associated with 1) anxiety/depression; 2) severe airflow obstruction and frailty; 3) cardiovascular disease and diabetes and 4) obesity/atopy. A fifth cluster was associated with low prevalence of most comorbid conditions. CONCLUSIONS: COPD patients can be sub-classified into groups with differing risk factors, comorbidities, and prognosis, based on data included in their primary care records. The identified clusters confirm findings of previous clustering studies and draw attention to anxiety and depression as important drivers of the disease in young, female patients

    A molecular dynamics study of the vascular endothelial glycocalyx layer

    No full text
    The luminal surface of endothelial cells which line the vasculature is coated with a layer of membrane-bound macromolecules of a mixed carbohydrate and protein nature, collectively described as a glycocalyx, from the greek meaning "sweethusk/covering". Experiments have consistently revealed the pivotal role of the endothelial glycocalyx layer in vasoregulation and the layer's contribution to mechanotransduction pathways. However, the exact mechanism by which the glycocalyx mediates and interprets fluid shear stress remains elusive. This study employsatomic-scale molecular simulation with the aim of investigating the conformational and orientation properties of the highly flexible components of the glycocalyx and their suitability as transduction molecules under hydrodynamic loading. To this aim, two molecular dynamics systems were constructed. The first system focused on the impact of flow on a tethered, branched, oligosaccharide. Fluid flow was shown to only moderately affect the conformation populations explored by the oligosaccharide, in comparison to static conditions. On the other hand, the glycan exhibited a significant orientation change, when compared to simple diffusion, aligning itself with the flow direction. The tethered end of the glycan, an asparagine amino-acid, experienced conformational changes as a result of this flow-induced bias. Results of the "glycan in flow" model suggest that shear flow through the layer can have an impact on the conformational properties of saccharide-decorated transmembrane proteins, thus probably acting as a mechano-transducer. The second system consisted of charged and non-charged heparan sulfate, a component found in large quantities in the endothelial glycocalyx layer. Systems of paired heparan sulfate strands were investigated under conditions of increasing proximity, which is the expected effect of compression of the layer under the effect of flow; a process explored formally using the adaptive biasing force method. This approach warranted the implementation of enhanced sampling for the high energy states of heparan sulfate in close proximity. Results of the heparan sulfate model suggest that areas of locally high charge density within the glycocalyx, generally areas of high sulfation, are characteristically more resistant to compression than non-sulfated areas. The sulfation mix therefore emerges as an important determinant of glycocalyx mechanical properties.</p

    A molecular dynamics study of the vascular endothelial glycocalyx layer

    No full text
    The luminal surface of endothelial cells which line the vasculature is coated with a layer of membrane-bound macromolecules of a mixed carbohydrate and protein nature, collectively described as a glycocalyx, from the greek meaning "sweethusk/covering". Experiments have consistently revealed the pivotal role of the endothelial glycocalyx layer in vasoregulation and the layer's contribution to mechanotransduction pathways. However, the exact mechanism by which the glycocalyx mediates and interprets fluid shear stress remains elusive. This study employsatomic-scale molecular simulation with the aim of investigating the conformational and orientation properties of the highly flexible components of the glycocalyx and their suitability as transduction molecules under hydrodynamic loading. To this aim, two molecular dynamics systems were constructed. The first system focused on the impact of flow on a tethered, branched, oligosaccharide. Fluid flow was shown to only moderately affect the conformation populations explored by the oligosaccharide, in comparison to static conditions. On the other hand, the glycan exhibited a significant orientation change, when compared to simple diffusion, aligning itself with the flow direction. The tethered end of the glycan, an asparagine amino-acid, experienced conformational changes as a result of this flow-induced bias. Results of the "glycan in flow" model suggest that shear flow through the layer can have an impact on the conformational properties of saccharide-decorated transmembrane proteins, thus probably acting as a mechano-transducer. The second system consisted of charged and non-charged heparan sulfate, a component found in large quantities in the endothelial glycocalyx layer. Systems of paired heparan sulfate strands were investigated under conditions of increasing proximity, which is the expected effect of compression of the layer under the effect of flow; a process explored formally using the adaptive biasing force method. This approach warranted the implementation of enhanced sampling for the high energy states of heparan sulfate in close proximity. Results of the heparan sulfate model suggest that areas of locally high charge density within the glycocalyx, generally areas of high sulfation, are characteristically more resistant to compression than non-sulfated areas. The sulfation mix therefore emerges as an important determinant of glycocalyx mechanical properties.</p

    Evaluation of data processing pipelines on real-world electronic health records data for the purpose of measuring patient similarity.

    No full text
    BackgroundThe ever-growing size, breadth, and availability of patient data allows for a wide variety of clinical features to serve as inputs for phenotype discovery using cluster analysis. Data of mixed types in particular are not straightforward to combine into a single feature vector, and techniques used to address this can be biased towards certain data types in ways that are not immediately obvious or intended. In this context, the process of constructing clinically meaningful patient representations from complex datasets has not been systematically evaluated.AimsOur aim was to a) outline and b) implement an analytical framework to evaluate distinct methods of constructing patient representations from routine electronic health record data for the purpose of measuring patient similarity. We applied the analysis on a patient cohort diagnosed with chronic obstructive pulmonary disease.MethodsUsing data from the CALIBER data resource, we extracted clinically relevant features for a cohort of patients diagnosed with chronic obstructive pulmonary disease. We used four different data processing pipelines to construct lower dimensional patient representations from which we calculated patient similarity scores. We described the resulting representations, ranked the influence of each individual feature on patient similarity and evaluated the effect of different pipelines on clustering outcomes. Experts evaluated the resulting representations by rating the clinical relevance of similar patient suggestions with regard to a reference patient.ResultsEach of the four pipelines resulted in similarity scores primarily driven by a unique set of features. It was demonstrated that data transformations according to each pipeline prior to clustering can result in a variation of clustering results of over 40%. The most appropriate pipeline was selected on the basis of feature ranking and clinical expertise. There was moderate agreement between clinicians as measured by Cohen's kappa coefficient.ConclusionsData transformation has downstream and unforeseen consequences in cluster analysis. Rather than viewing this process as a black box, we have shown ways to quantitatively and qualitatively evaluate and select the appropriate preprocessing pipeline

    Methods for enhancing the reproducibility of clinical epidemiology research in linked electronic health records: results and lessons learned from the CALIBER platform

    Get PDF
    ABSTRACT Objectives Electronic health records (EHR) across primary, secondary, and tertiary care are increasingly being linked for research at a population level. The increasing volume, variety, velocity, and veracity of big biomedical data makes research reproducibility challenging. Research reproducibility and replicability is essential for the external validity and generalizability of scientific findings and the lack of standardized approaches and tools and relative opaqueness of data manipulation methods is detrimental to their integrity. The objective of this study was to explore, evaluate and propose methods, tools and approaches for addressing some of the challenges associated with reproducibility when using linked national electronic health records for research. Approach We systematically searched literature and internet resources for well-established and appropriate methods, tools, and approaches used in related scientific disciplines. The identified techniques were systematically evaluated in terms of their capacity to facilitate reproducible research in routinely collected health data across the life course of a research project: from protocol creation and raw data curation to data transformation and statistical analysis though to finding dissemination and impact. Most importantly, the identified techniques were tested and applied in a contemporary database of linked electronic health records. CALIBER is a research data platform of linked national electronic health records from primary care (Clinical Practice Research Datalink), secondary care (Hospital Episode Statistics), acute coronary syndrome disease registry (Myocardial Ischaemia National Audit Project) and cause-specific mortality (Office for National Statistics) for roughly 2 million adults. Results Firstly, we present the review of methods and approaches which we identified through our search. Secondly, we propose a set of recommendations for applying them within the context of research projects making use of linked routinely collected health data. Focal interests included: a) documentation of data (attributes, relationships, and interpretation), b) data processing (source code, instructions, and parameters), c) results (visualizations, figures), and any supplementary material. Thirdly, we present approaches around a) raw data curation using international metadata standards, b) study protocol encoding, c) provenance and sharing of data transformation and statistical analysis operations, d) public and private data retention, and e) computable EHR-driven phenotypes. Conclusion The complexity and size of routinely collected health data is increasing through linkages across distributed data sources. The scientific community benefits from findings which can be replicated. This study presents a number of methods, tools and approaches across the project life course for ensuring that their research studies are reproducible and replicable from the wider scientific community

    Evolving treatment patterns and outcomes of neovascular age-related macular degeneration over a decade

    No full text
    PURPOSE: Management of neovascular age-related macular degeneration (nAMD) has evolved over the last decade with several treatment regimens and different medications. This study describes the treatment patterns and, importantly, visual outcomes over ten years in a large cohort of patients. DESIGN: Retrospective analysis of electronic health records from 27 National Health Service (NHS) secondary care healthcare providers in the UK. PARTICIPANTS: Treatment-naïve patients receiving at least three intravitreal anti-vascular endothelial growth factor (VEGF) injections for nAMD in their first six months of follow-up were included. Patients with missing data for age or gender and those aged less than 55 were excluded. METHODS: Eyes with at least three years of follow-up were grouped by years of treatment initiation, and three-year outcomes were compared between the groups. Data were generated during routine clinical care between 09/2008 and 12/2018. MAIN OUTCOME MEASURES: Visual acuity, number of injections, number of visits. RESULTS: A total of 15,810 eyes of 13,705 patients receiving 194,904 injections were included. Visual acuity (VA) improved from baseline during the first year, but dropped thereafter, resulting in loss of visual gains. This trend remained consistent throughout the past decade. Although an increasing proportion of eyes remained in the driving standard, this was driven by better presenting visual acuities over the decade. The number of injections dropped substantially between the first and subsequent years, from a mean of 6.25 in year 1 to 3 in year 2 and 2.5 in year 3, without improvement over the decade. In a multivariable regression analysis, final VA improved by 0.24 letters for each year since 2008, and younger age and baseline VA were significantly associated with VA at three years. CONCLUSION: Our findings show that despite improvement in functional VA over the years, primarily driven by improving baseline VA, patients continue to lose vision after the first year of treatment, with only marginal change over the past decade. The data suggest that these results may be related to suboptimal treatment patterns, which have not improved over the years. Rethinking treatment strategies may be warranted, possibly on a national level or through the introduction of longer-acting therapies
    corecore